========================================================
This report explores a dataset containing quality ratings and 11 chemical properties for approximately 1,600 wines. Below is a summary of the dataset. We have transformed the label ‘quality’ into an ordered factor, to facilitate later plotting.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
summary(wine)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The following section displays 10 univariate plots, which will help to understand the structure of the individual variables in the wine dataset.
We observe that the majority of wines are rated 5 and 6 with a maximum rating of 8. The purpose of this analysis is to identify which chemical properties (features) influence the quality of wines (label).
The distribution of residual sugar shows multiple outliers, with values above 8. The mode appears to be around the region of 2.
I have limited the observations of residual sugar to a shorter interval to better see what was happening around the peak of count observed around 2.0.
Likewise for chorides, we can observe a long-tailed distribution, with a mode at 0.08.
We can observe that the total sulfur dioxide distribution is positively skewed
We can observe that the mode is in the region of 3.3, which is an acid solution.
We can observe that the sulphate distribution is positively skewed, with a mode around 0.6.
What is the structure of your dataset?
There are 1,600 wines in the dataset, with 11 features: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. The variable X is just an index number and will not be used for this analysis.
All variables are numerical.
Some observations:
What is/are the main feature(s) of interest in your dataset?
The main features are the acidic properties of the wines (pH, fixed acidity, volatile acidity, citric acid) and alcohol properties of the wines. We will examine how those features can predict the label of wine quality.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
Other features like sugar, sulfur will help examine the relationship between chemical properties and wine quality.
Did you create any new variables from existing variables in the dataset?
I did create the new variable ‘other.sulfur.dioxide’ being the difference between the total sulfur dioxide and the free sulfur dioxide.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
I converted the label ‘quality’ from a numerical type to an ordered factor.
This section describes the relationships between 2 variables. It starts by plotting a pair plot, which will help to visualize correlations between variables.
Using the visualization from the pair plot, I propose to investigate further the correlation between:
Selecting the largest correlation coefficient from the correlation matrix, I propose to study in more details the following correlations between:
We can observe that the interquartile ranges vary from between wines of different quality. This should be explored further with an ordinal logistic regression later.
We can also observe that the citric acid level vary from wines of poor quality to wines of high quality. We will investigate further later.
We can observe that high quality wines (7 and 8) have higher alcohol content generally above 11%.
We can observe a moderate correlation between fixed acidity and citric acid, with a R^2 value of 0.451.
We can observe a moderate correlation between volatile acidity and citric acid, with a R^2 value of 0.305
We can observe a moderate correlation between the log of fixed acidity and pH, with a R^2 value of 0.499.
We can observe a lot of dispersion, around a general trendline. Let’s smooth the data by calculating the mean pH for values of fixed acidity.
We can observe a good correlation between log fixed acidity and pH mean, with a R^2 value of 0.903.
This is expected as the pH is a function of the log concentration of H+ ions, which is proportional to the fixed acidity concentration.
We can observe a weak correlation between citric acid and pH, with a R^2 value of 0.294.
We can observe a lot of dispersion, around a general trendline. Let’s smooth the data by calculating the mean pH for values of citric acid.
We can observe a good correlation between citric acid and pH mean, with a R^2 value of 0.731.
We can observe a good correlation between total sulfur dioxide and free sulfur dioxide, with a R^2 value of 0.799.
This is expected as the two variables are colinear.
We can observe moderate correlation between fixed acidity and density, with a R^2 value of 0.446.
We can observe weak correlation between density and alcohol, with a R^2 value of 0.246.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
Using the visualization from the pair plot, we can observe good correlation between:
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
The scatter plots and simple linear models allowed us to observe good correlation between the following features:
Those correlations above are expected because the chemical properties are chemically dependent between each other.
The scatter plots and simple linear models allowed us to observe moderate correlation between the following features:
The scatter plots and simple linear models allowed us to observe weak correlation between the following features:
As such, if we were to use them in a prediction model, we should be cautious about mistakenly showing spurious correlations.
What was the strongest relationship you found?
The strongest relationship for the quality of wine was found with the volatile acidity.
In this section, we choose to plot scatterpolts between features that have shown strong relationships previously and add a third variable being the main label of interest (quality).
We can observe on the previous lots that wines of good quality (7 and 8, in green) are generally clustered separately from wines of low quality (3 and 4, in red and orange).
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
The relationship between the label quality and volatile acidity, alcohol were strengthened.
Were there any interesting or surprising interactions between features?
The relationship between density on fixed acidity showed interaction.
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
We created an ordinal logistic regression model using the MASS library. The purpose of the model is to predict the wine quality using the features available in the dataset.
The relationship studied was:
quality ~ volatile.acidity + alcohol + fixed.acidity + density + citric.acid + sulphates
We splitted the dataset into test and training sets. We then created several models by adding features one by one, in order to study the influence of those features in our ability to predict the wine quality. The metric used to evaluate the performance of the model was the accuracy.
We included a model that uses all features as variables to our model.
#Splitting test and train sets
set.seed(1000)
wine_train <- sample_frac(wine, 0.75)
wine_test <- sample_frac(wine, 0.25)
# Models adding one variable at the time
m1 <- polr(quality ~ volatile.acidity, data = wine_train)
m2 <- update(m1, ~ . + alcohol)
m3 <- update(m2, ~ . + density)
m4 <- update(m3, ~ . + fixed.acidity)
m5 <- update(m4, ~ . + sulphates)
m6 <- update(m5, ~ . + citric.acid)
# Model with all variables
m7 <- polr(quality ~ volatile.acidity + alcohol + density + fixed.acidity +
sulphates + citric.acid + residual.sugar + chlorides +
free.sulfur.dioxide + other.sulfur.dioxide + pH,
data = wine_train)
summary(m7)
##
## Re-fitting to get Hessian
## Call:
## polr(formula = quality ~ volatile.acidity + alcohol + density +
## fixed.acidity + sulphates + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + other.sulfur.dioxide +
## pH, data = wine_train)
##
## Coefficients:
## Value Std. Error t value
## volatile.acidity -3.405290 0.468934 -7.2618
## alcohol 0.911891 0.069009 13.2142
## density 2.526928 1.139739 2.2171
## fixed.acidity 0.035362 0.059677 0.5926
## sulphates 2.739864 0.400287 6.8447
## citric.acid -0.817503 0.542019 -1.5083
## residual.sugar 0.059375 0.044569 1.3322
## chlorides -4.668701 1.483980 -3.1461
## free.sulfur.dioxide -0.001046 0.006263 -0.1670
## other.sulfur.dioxide -0.011570 0.002696 -4.2917
## pH -1.842257 0.583175 -3.1590
##
## Intercepts:
## Value Std. Error t value
## 3|4 -0.6275 1.1673 -0.5375
## 4|5 1.2790 1.1653 1.0976
## 5|6 5.0865 1.1662 4.3618
## 6|7 7.8978 1.1773 6.7082
## 7|8 10.8068 1.2058 8.9621
##
## Residual Deviance: 2302.12
## AIC: 2334.12
results <- data.frame(wine_test$quality)
colnames(results)[1] <- 'actual'
results$predicted.1 <- ordered(predict(m1, newdata = wine_test),
levels = c(3, 4, 5, 6, 7, 8))
results$predicted.2 <- ordered(predict(m2, newdata = wine_test),
levels = c(3, 4, 5, 6, 7, 8))
results$predicted.3 <- ordered(predict(m3, newdata = wine_test),
levels = c(3, 4, 5, 6, 7, 8))
results$predicted.4 <- ordered(predict(m4, newdata = wine_test),
levels = c(3, 4, 5, 6, 7, 8))
results$predicted.5 <- ordered(predict(m5, newdata = wine_test),
levels = c(3, 4, 5, 6, 7, 8))
results$predicted.6 <- ordered(predict(m6, newdata = wine_test),
levels = c(3, 4, 5, 6, 7, 8))
results$predicted.7 <- ordered(predict(m7, newdata = wine_test),
levels = c(3, 4, 5, 6, 7, 8))
The following shows the actual quality and the quality rating predicted by the 6 models we created.
head(results)
## actual predicted.1 predicted.2 predicted.3 predicted.4 predicted.5
## 1 6 5 5 5 5 5
## 2 4 6 5 5 5 5
## 3 6 6 6 6 6 6
## 4 5 5 5 6 5 5
## 5 6 5 5 5 5 5
## 6 6 5 6 6 6 6
## predicted.6 predicted.7
## 1 6 6
## 2 5 5
## 3 6 6
## 4 5 5
## 5 5 5
## 6 6 6
The following shows the accuracy score for each of the 6 models we created.
misClasificError <- c(mean(results$predicted.1 != results$actual),
mean(results$predicted.2 != results$actual),
mean(results$predicted.3 != results$actual),
mean(results$predicted.4 != results$actual),
mean(results$predicted.5 != results$actual),
mean(results$predicted.6 != results$actual),
mean(results$predicted.7 != results$actual))
print(1-misClasificError)
## [1] 0.4775 0.5725 0.5650 0.5450 0.5700 0.5725 0.5800
We can observe that the performance of the model is pretty weak, with accuracy scores barely above 0.5, which means that we are able to predict accurately the quality above 50% of the time. This is still higher than random chance at 16.6%, as there are 6 possible quality ratings to predict per wine.
We can observe that the highest gain in accuracy is obtained by adding the variables alcohol (model m2) and sulphates (model m5) to the model. The highest accuracy is achieved by including all variables in the model (model m7) with a score of 0.58.
This score can be almost achieved with only 2 variables of the model m2 with volatile acidity and alcohol.
The limitations of this exercise are:
This plot shows the distribution of volatile acidity for each quality rating. We can visualize with the position of the boxplots that the volatile acidity tends to be lower for higher quality rated wines. Only wines rated 7 and 8 have similar distributions for volatile acidity. This makes the volatile acidity the best candidate for a primary predictor of red wine quality.
This plot shows by quality rating, how the wines are distributed in terms of volatile acidity and citric acid concentration.
We can observe that the green points (high quality wines with a rating of 7 or 8) tend to be clustered around high citric acid concentration (0.25 to 0.75 g/L) and low volatile acidity (0.2 to 0.6 g/L). Low quality wines tend to be clustered around low citric acid concentration (below 0.25 g/L) and high volatile acidity (above 0.6).
This clustering indicate that the combination of volatile acidity and citric acid concentration would make good predictors for wine quality.
This plot shows by quality rating, how the wines are distributed in terms of volatile acidity and alcohol content.
We can observe that the green points (high quality wines with a rating of 7 or 8) tend to be clustered around high alcohol content (10 % and above) and low volatile acidity (0.2 to 0.6 g/L). Low quality wines tend to be clusteredaround low alcohol content (below 10 %) and high volatile acidity (above 0.6).
This clustering indicate that the combination of volatile acidity and alcohol content would make good predictors for wine quality. ——
In this analysis, there were around 1,600 observations with 11 variables to consider. We have started by plotting univariate plots to understand the data structure. Then we looked at relationships between the label of of study (quality) against other features. We isolated the pairs that displayed the strongest relationship on a pair plot. We explored further by plotting bi-variate plots and ultimately multi-variate plots. We eventually started a quick prediction model based on an ordinal logistics regression.
The analysis revealed that volatile acidity and alcohol contents were good predictors of the wine quality. The volatile acidity tends to be lower for higher quality rated wines (below 0.6 g/L) . This is to be expected as high volatile acidity is largely comprised of acetic acid (vinegar), which is associated with unpleasant taste.
The analysis confirms the influence of alcohol content in the wine quality. This is to be expected as tasters would expect from a red wine a stronger body.
The influence of fixed acidity was not clearly demonstrated by the model (m4). This is not surprising, as citric acid - a desired acid for its softer acid taste and pleasant aromatic properties - is only one of its many constituents.
It was surprising to see that chlorides and sulfur dioxides did not appear to be good predictors. Naturally, one would think that those chemical properties, associated with negative perception (taste or health), would be good indicators of lower wine quality.
Overall, acceptable prediction of the wine quality can already be achieved with only 2 variables of the model m2 with volatile acidity and alcohol.
There are several axes of further exploration and improvement in the prediction by deep-diving: